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Abstract —We consider the problem of decomposing the total 
mutual information conveyed by a pair of predictor random vari¬ 
ables about a target random variable into redundant, nnique and 
synergistic contributions. We focus on the relationship between 
“redundant information” and the more familiar information- 
theoretic notions of “common information.” Our main contri¬ 
bution is an impossibility result. We show that for independent 
predictor random variables, any common information based mea¬ 
sure of redundancy cannot induce a nonnegative decomposition 
of the total mutual information. Interestingly, this entails that any 
reasonable measure of redundant information cannot be derived 
by optimization over a single random variable. 

Keywords—common and private information, synergy, redun¬ 
dancy, information lattice, sufficient statistic, partial information 
decomposition 

I. Introduction 

A complex system consists of multiple interacting parts 
or subsystems. A prominent example is the human brain that 
exhibits structure spanning a hierarchy of multiple spatial and 
temporal scales III. A series of recent papers have focused on 
the problem of information decomposition in complex systems 
El-HD. A simple version of the problem can be stated as 
follows; The total mutual information that a pair of predictor 
random variables (RVs) {Xi,X 2 ) convey about a target RV Y 
can have aspects of synergistic information (conveyed only by 
the joint RV {X 1 X 2 )), of redundant information (identically 
conveyed by both Xi and X 2 ), and of unique or private 
information (exclusively conveyed by either Xi or X 2 ). Is 
there a principled information-theoretic way of decomposing 
the total mutual information I{XiX 2 \Y) into nonnegative 
quantities? 

Developing a principled approach to disentangling syn¬ 
ergy and redundancy has been a long standing pursuit in 
neuroscience and allied field£] IT], B9l - ll54l . However, the 
traditional apparatus of Shannon’s information theory does not 
furnish ready-made tools for quantifying multivariate interac¬ 
tions. Starting with the work of Williams and Beer ||2l, several 
workers have begun addressing these issues Ei-nii. Eor the 
general case of K predictors, Williams and Beer HI proposed 
the partial information (PI) decomposition framework to spec¬ 
ify how the total mutual information about the target is shared 
across the singleton predictors and their overlapping or disjoint 
coalitions. Effecting a nonnegative decomposition has however 
turned out to be a surprisingly difficult problem even for the 
modest case of K = 3 IS, 13. Eurthermore, there seems to be 
no clear consensus as to what is an ideal measure of redundant 
information. 

*We invite the interested reader to see Appendix B, where we provide a 
sampling of several interesting examples and applications, where information- 
theoretic notions of synergy and redundancy are deemed useful. 
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We focus on the relationship between redundant informa¬ 
tion and the more familiar information-theoretic notions of 
common information ITSl . lfT9ll . We distinguish synergistic and 
redundant interactions that exist within a group of predictor 
RVs from those that exist between a group of predictor RVs 
and a target RV. A popular measure of the former (symmetric) 
type of interaction is the co-information 1431 . Our main inter¬ 
est, however, lies in asymmetric measures of interaction that 
distinguish the target RV from the group of predictor RVs. An 
instance of such an interaction is when populations of retinal 
ganglion cells (predictors) interact to encode a (target) visual 
stimulus. f62l . Yet another instance is when multiple genes 
(predictors) cooperatively interact within cellular pathways 
to specify a (target) phenotype 1541 . In building up to our 
main contribution, we review and extend existing (symmetric) 
measures of common information to capture the asymmetric 
nature of these interactions. 

Section organization and summary of results. In Section 
ll, building on the heuristic notion of embodying information 
using cr-algebras and sample space partitions, we formalize 
the notions of common and private information structures. 
Information in the technical sense of entropy hardly captures 
the structure of information embodied in a source. Eirst 
introduced by Shannon in a lesser known, short note l26l , 
information structures capture the quintessence of “information 
itself.” We bridge several inter-related domains—notably, game 
theory, distributed control, and team decision problems to 
investigate the properties of such structures. Surprisingly, while 
the ideas are not new, we are not aware of any prior work or 
exposition where common and private information structures 
have received a unified treatment. Eor instance, the notion of 
common information structures have appeared independently 
in at least four different early works, namely, that of Shannon 
l26l . Gacs and Korner US), Aumann El, and Hexner and 
Ho l29l . and more recently in ITSl , 1271 , l46l . In the first 
part of (mostly expository) Section ll, we make some of these 
connections explicit for a finite alphabet. 

In the second part of Section ll, we take a closer look at 
the intricate relationships between a pair of RVs. Inspired by 
the notion of private information structures 123 . we derive 
a measure of private information and show how a dual of 
that measure recovers a known result m in the form of the 
minimal sufficient statistic for one variable with respect to 
the other. We also introduce two new measures of common 
information. The richness of the decomposition problem is 
already manifest in simple examples when common and private 
informational parts cannot be isolated. 

In Section III, we inquire if a nonnegative PI decomposition 
of I{XiX 2 ',Y) can be achieved using a measure of redundancy 
based on the notions of common information due to Gacs and 



Korner flSl and Wyner llT9l . We answer this question in the 
negative. For independent predictor RVs when any nonvan¬ 
ishing redundancy can be attributed solely to a mechanistic 
dependence between the target and the predictors, we show 
that any common information based measure of redundancy 
cannot induce a nonnegative PI decomposition. 

II. Information Decomposition into Common and 
Private Parts: The Case eor Two Variables 

Let be a hxed probability triple, where fl is the 

set of all possible outcomes, elements of the cr-algebra are 
events and P is a function returning an event’s probability. A 
random variable (RV) X taking values in a discrete measurable 
space (A’,X) (called the alphabet) is a measurable function 
X : ^ X such that if a: S X, then X~^{x) = {oj : X{uj) € 

x) S g. The (T-algebra induced by X is denoted by <j{X). We 
use “iff” as a shorthand for “if and only if”. 

A. Information Structure Aspects 

The heuristic notion of embodying information using a- 
algebras is not new HSlR. Il47ll . A sense in which a{X) 
represents information is given by the following lemma (see 
Lemma 1.13 in my 

Lemma 1 (Doob-Dynkin Lemma). Let Xi : LI ^ Xi and 
X 2 : LI ^ X 2 be two RVs, where (A 2 ,X 2 ) is a standard 
Borel space. Then X 2 is a{Xi)-measurable, or equivalently 
(tIXt) C crfATi) iff there exists a measurable mapping f : 
Xi ^ Xa such that X 2 = f{Xi). 

Suppose an agent does not know the “true” point oj € LI 
but only observes an outcome Xiioj). If for each drawn oj, 
he takes some decision X2{oj), then clearly Xiioj) determines 
^ 2 ( 0 ;) so that we necessarily have X 2 = f{Xi). The Doob- 
Dynkin lemma says that this is equivalent to X 2 being a{Xi)- 
measurable under some reasonable assumptions on the under¬ 
lying measurable spaces. 

From Lemma 1, it is easy to see that Xi and X 2 carry 
the “same information” iff a{Xi) = a{X 2 ). This notion 
of informational sameness (denoted Xi X 2 ) induces a 
partition on the set of all RVs into equivalence classes called 
information elements. We say that the RV Xs is representative 
of the information element S. First introduced by Shannon in a 
(perhaps) lesser known, short note E6\ . information elements 
capture the quintessence of information itself in that all RVs 
within a given class can be derived from a representative RV 
for that class using hnite state reversible encoding operations, 
i.e., with 1-to-l mappings. Contrast the notion of information 
elements with the Shannon entropy of a source X, denoted 
H{X). Two sources Xi and X 2 might produce information 
at the same entropy rat430, but not necessarily produce the 
“same” information. Thus, Xi — X 2 => Xi X 2 
H{Xi) = H{X 2 ), but the converse of neither implication is 
true. 

A partial order between two information elements and 
S 2 is defined as follows: Si S 2 iff iL(5'2|5'i) = 0 or 


^See though Example 4.10 in Eg for a counterexample. 

^Most of the arguments here are valid for a countable X. Entropies for 
countable alphabets can be infinite and even discontinuous. In the later 
sections, we shall be dealing solely with finite discrete RVs. 

"^For finite or countable X\, X2, ii f '• Xi —>• ^2 is a bijection such 
that X2 = then H(X2) = H{Xi), i.e., entropy is invaiiant under 

relabeling. 


equivalently iff is (T(Xsj)-measurable. We say that 
is larger than S2 or equivalently S2 is an abstraction of ^i. 
Likewise, we write S2 if S2 )= Si, when is smaller 

than S2. There exists a natural metric p on the space of infor¬ 
mation elements and an associated topology induced by p ll26l . 
p is defined as follows: p{Si,S2) = iF(S'i|5'2) + H{S2\Si). 
Clearly, p{Si,S2) = 0 iff Si S2 and S2 )= Si. The join of 
two information elements and S2 is given by sup{5'i,5'2} 
(denoted S*! V S2) and is called the joint information of both 
Si and S2. The joint RV {Xsi,Xs2) is representative of the 
joint information. Likewise, the meet is given by inf{S'i,5'2} 
(denoted 5'i A S2) and is called the common information of 
Si and S2. {Xsi A Xs^) is the representative common RV 
ll26l . The entropy of both the joint and common information 
elements are invariant in a given equivalent class. 

A finite set of information elements endowed with the 
partial order )=, join (V), and meet (A) operations have the 
structure of a metric lattice which is isomorphic to a finite 
partition lattice ED, Eli. As a simple example, the lattice 
structure arising out of a XOR operation is the diamond 
lattice M3, the smallest instance of a nondistributive modular 
lattice. The nondistributivity is easily seen as follows: let 
S 3 = Xor(S'i, 5'2) where and S 2 are independent informa¬ 
tion elements. In this example, (S'3 A S2) V (53 A 5 'i) = 0 , 
whereas S3 A {S2 V Si) = S3 ^ 0 . In general however, 
information lattices are neither distributive nor modular Il26l . 
ED- More important for our immediate purposes is the notion 
of common information as defined by Shannon ESj which 
arises naturally when quantifying information embodied in 
structure. Contrast this with Shannon’s mutual information 
which does not correspond to any element in the information 
lattice. 

The modeling of information structures can also be mo¬ 
tivated nonstochastically, i.e., when the underlying space has 
no probability measure associated with it (e.g., see m, El, 
Ea, eqd- Let (n,S^) be a measurable space, where LI is the 
set of possible states of Nature, and elements of ^ are events. 
One of the states w G O is the “true” state. An event E occurs 
when uj G E. Define an uncertain variable X 1461 taking 
values in a discrete measurable space {X,X) as the measurable 
function X : LI ^ X where X contains all singletons. The cr- 
algebra induced by X is a{X) = a{{X~^{T) : T G X}). 
X generates a partition on LI called the information partition 
Vx = {X~^(x) G LI : X G X}. Since the alphabet X is finite 
or countable, aiVx) = <t(X). 

The information structure {LlfPx) specifies the extent to 
which an agent observing X can distinguish among different 
states of Nature. Given an observation x = X(pj), an agent 
endowed with a partition Vx only knows that the true state 
belongs to Vx{<jj), where Vx{^) is the element of X’s parti¬ 
tion that contains oj. Given a pair of partitions {VifPj) on LI, 
Vi is said to he finer than Vj and that Vj is coarser than Vi if 
Vi{oj) C Vj (w) Vw G LI. If Vi is finer than Vj , then agent i has 
more precise information than agent j in that i can distinguish 
between more states of Nature. We say X knows an event E 
at OJ if Vx^jj) G E. E can only be known if it occurs. The 
event that X knows E is the set Kx{E) = {uj : Vx{<-^) C E}. 
Then, given two agents, Alice observing X and Bob observing 
Y, Kx (E) n Ky (E) is the event that E is mutually known 
(between Alice and Bob). We say that an event E is commonly 
known (to both Alice and Bob) if it occurs, or equivalently, 
an event E is common information iff E G a{Vx), where 




Vx = T’x/^'Py is the finest common coarsening of the agents’ 
partitionfl0. Since the cr-algebra generated by Vx is simply 
cr{Vx) r\a{VY), or equivalently, a{X) na{Y), E is common 
information iff E G <j{X) fl cr{Y). Commonly knowing 
i? is a far stronger requirement than mutually knowing E. 
For finite X, y, the common information structure admits a 
representation as a graph Cxy with the vertex set Vx V Vy 
and an edge connecting two vertices if the corresponding atoms 
Vi and Vj are contained in a single atom of Vx or Vy or of 
both. The connected components of Cxy are in one-to-one 
correspondence with the atoms of Vx Bfil . 

Example 1. Let Lt = {wi,W 2 ,W 3 ,W 4 }. Alice observes X 
which generates the information partition Vx = uJiui 4 \ui 2 \^ 3 - 
Likewise, Bob observes Y which induces the partition, Vy = 
a;iW 21 ^31^4- Let uj 2 be the true state of Nature. Consider the 
event E = {uji,uj 2 }. Both Alice and Bob know E at 102 , 
since Vx{'^ 2 ) = {^ 2 } C E and Vy{^ 2 ) = {^ 1 ,^ 2 } C E. 
The event that Alice knows E is simply the true state { 1 JJ 2 } 
{i.e., Kx(E) = {a; 2 }), whereas for Bob, Ky{E) = {wi,W 2 }. 
Clearly, Bob cannot tell apart the true state {a; 2 } (in which 
Alice knows E) from {wi} (in which Alice does not know E). 
Hence, E is not commonly known to Alice and Bob. 

On the other hand, it is easy to check that the events 
{wi,W 2 ,W 4 } and {ws} are common information. Indeed, Vx — 
Vx A Vy = {{wijtt’ 2 ,W 4 },{a; 3 }}. Cxy has the vertex set 
Vx V Vy = {{wi};{ti^ 2 },{w 3 },{w 4 }} and the connected 
components of Cxy correspond to the atoms {wi,a; 2 ,a; 4 } and 
{ws} ofVx- 

One may also seek to characterize the private information 
structures of the agents. Let O be a finite set of states of 
Nature. To simplify notation, let X denote the agent X as well 
as its information partition. Let Alice and Bob be endowed, 
respectively, with information partitions X and Y so that X 
and Y are subalgebras of a 2l^l-element Boolean algebra. 
One plausible definition of the private information structure 
of Y is the minimal amount of information that X needs from 
Y to reconstruct the joint information Y \/ X ll28l . Define 
PIx{Y) ^{Z : ZW X = Y\/ X]Z C Y]Z minimal}. Since 
PIx{Y) complements X to reconstruct YV X, minimality of 
Z entails that \/Z G PIx{Y), Z A X = 0, where 0 denotes 
the two-element algebra. Witsenhausen 1^ showed that the 
problem of constructing elements of PIx{Y) with minimal 
cardinality is equivalent to the chromatic number problem for 
a graph Gy with the vertex set Y and an edge connecting 
vertices Vi and Vj iff there exists an atom x G X such that 
Vi C\ X f 0 and vj H x f 0. Unfortunately, since there are 
multiple valid minimal colorings of Gy, PIxiX) is not be 
unique. The following example illustrates the point. 

Example 2. Consider the set, VL = {a;i,...,a;i6}. Let Alice and 
Bob’s partitions be respectively, X = wia; 3 |a; 4 W 5 |w 6 a; 7 |a ;8 


^The astute reader will immediately notice the connection with the notion 
of common knowledge due to Aumann (ni In keeping with our focus on 
information structure, we prefer the term “common information” to “common 
knowledge.” Indeed, for finite or countably infinite information partitions, 
common knowledge is defined on the basis of the information contained in 
Vx as follows. An event E is common knowledge at uj iff Vx{^) C E, i.e., 
the event that E is common knowledge is C{E) = {lo : Vx{(^) C E}. For 
any event E, C{E) G E. E is common information if C{E) = E 1171 . 1471 . 

^For uncountable alphabets, see e.g., for a more nuanced discussion 
on representing information structures using ir-algebras of events instead of 
partitions. 


W9|wioa;2|wiia;i3|wi4Wi5|a;i2a;i6 and Y = a;iW2|w3a;4|a;5 
a;6jw7a;8|w9a;io|wiia;i2|a;i3a;i4|wi5Wi6. Gy = (Y,£) has the 
edge set £ = {{ujiuj2,uj3uj4},{uj3uj4,uj5uje},{uj5uje,uj7ujs}, 

{w7W8,W9Wio},{a;9a;io,a;iW2},{wiiWi2,a;i3a;i4},{wi3a;i4,a;i5 

Wi6},{wi5a;i6,wiia;i2}}. 

Two distinct minimal colorings of Gy are as follows: 

(«) 7i = {a;ia;2,a;7W8,wiia;i2},72 = {a;3a;4,a;9a;io,wi5Wi6}, 
73 = {a;5a;6,a;i3Wi4}, 
so that 

PIx(Y) = wia;2W7a;8a;iia;i2|w3a;4W9a;ioa;i5a;i6|ct)5a;6Wi3Wi4, 

and 

(b) y'l = {a;iW2,W5W6,a;iia;i2},7'2 = {a;3a;4,a;7W8,wi3Wi4}, 
7^3 = {wgWiOjWisWiel, 
so that 

P^xO^) — l^l<^2ld5UJeU!iiU!i2\uJ3U!4UJ7U!sUJi3UJi4\u!gUJioUJi3UJiQ. 

It is easy to see that PIf{Y) M X = PI^{Y) V X = 

Y y X. Hence, such a minimal coloring is not unique and 
consequently, PIx(Y) is not unique. 

One would also like to characterize the information con¬ 
tained exclusively in either X or Y. The private information 
structure of Y with respect to X may be defined as the 
amount of information one needs to reconstruct Y from the 
common information X AY. Define PI(Y\X) = {Z : 
Z y {X AY) =Y;Z minimal}, where minimality of Z entails 
that yZ G PI{Y\X), if there exists a Z' such that Z' D Z 
and Z'y{XAY) = Y, then Z' PI{Y\X). We note that, if 
Z e PI{Y\X), then Zy X = Y y X and Z AX = 0. Hexner 
and Ho 1291 proposed and showed that this definition does not 
admit a unique specification for the private information of Y 
with respect to X as can be seen from the following example. 

Example 3. Consider the set, VL = {a;i,...,a;6} and the 
following partitions on £1 : X = a;ia; 2 |a; 3 |a; 4 W 5 |w 6 , and Y = 
a;i|a;2W3|w4|a;5a;6. Then we have, XyY = a;i|a;2|w3|a;4|a;5|a;6 
and X AY = uj 4 UJ 2 UJ 3 \uJ 4 UJ 3 UJe. It is easy to see that each of 
the following subalgebras satisfies the definition, i.e., given 
Zi = a;ia;4|a;2a;3a;5W6 and Z 2 = uJiUJ 5 U!e\u! 2 UJ 3 UJ 4 , we have, 
Ziy {X AY) = Zgy {X AY) =Y and Ziy X = Zgy X = 

Y y X. Hence, PI{Y\X) is not unique. 

Remark 1. We have the following observations. Note that 
if e PI{Y\X), then Ziy X = Y y X. Thus, one can 
find a Z 2 G PIx{Y) such that Z 2 C Zi. Choosing Z\ 
minimal, it follows that the cardinality of the minimal algebras 
of PI{Y\X) is lower bounded by the cardinality of the 
minimal algebras of PIx{Y) or equivalently by the chromatic 
number of Gy. Thus, X need not use all of PI{Y\X) to 
reconstruct Y y X. Furthermore, it is known that the lattice 
L of subalgebras of a finite Boolean algebra is isomorphic to 
a finite partition lattice ED. Thus, in general, L is not dis¬ 
tributive, nor even modular. Since both the structures PIx{Y) 
and PI{Y\X) consists of complements in L, nonmodularity 
of L implies the nonuniqueness of the private information 
structures. 

B. Operational Aspects 

We now turn to mainstream information-theoretic notions 
of “common information” (Cl). We introduce the remaining 



notation. For a discrete, finite-valued RV X, px{x) = P{X = 
x] denotes the probability mass function (pmf or distribu¬ 
tion) of X. We abbreviate px{x) as p{x) when there is no 
ambiguity. For X — {xn, n = 1,...,W}, the entropy H{X) 
of X can be written as H{pi,...,px) = where 

Pn = P{X = Xn} and 'Y},nPn ~ 1' Kullback-Leibler 
(KL) divergence from qx to px is defined as D{p\\q) ■■= 

J2^exPx{x)log^^. 

X — Y — Z denotes that X is conditionally independent of 
Z given Y (denoted X _L Z\Y), or equivalently, X,Y,Z form 
a Markov chain satisfying 

p{x,y,z) = = p{x\y)p{y,z), if p{y) > 0; else 0. 

Equivalently, p{y)p{x,y,z) = p{x,y)p{y,z). 

Let {Xi,Yi}'^i be i.i.d. copies of the pair {X,Y) ~ pxY 
on Xxy. An information source generating such a (stationary) 
sequence is called a two-component discrete, memoryless 
source (2-DMS). Given £ > 0, we say that X” £-recovers 
xn iff 7^ X”} < £. 

To fix ideas, consider a “one-decoder” network for the 
distributed compression of a 2-DMS ll20l . The correlated 
streams {Xi}'^^ and are encoded separately at rates 

Rx and Ry and decoded jointly by combining the two streams 
to £-recover (X",y”). A remarkable consequence of the 
Slepian-Wolf theorem 1201 is that the (minimum) sum rate 
of Rx + Ry = H{X,Y) is achievable. This immediately 
gives a coding-theoretic interpretation of Shannon’s mutual 
information (MI) as the maximum descriptive savings in sum 
rate by considering {X,Y) jointly rather than separately, i.e., 

I{X-Y) = H{X) + H{Y) - imn{Rx + Ry). 

Thus, for the one-decoder network, MI appears to be a natural 
measure of Cl of two dependent RVs. However, other networks 
yield different Cl measures. Indeed, as pointed out in 123, 
depending upon the number of encoders and decoders and the 
network used for connecting them, several notions of Cl can 
be defined. We restrict ourselves to two dependent sources and 
a “two-decoder” network when two different notions of Cl due 
to Gacs and Korner ifTSl and Wyner 1191 are well known. Each 
of these notions appear as solutions to asymptotic formulations 
of some distributed information processing task. 

Given a sequence generated by a 2-DMS {X x 

y, pxy), Gacs and Korner (GK) iflSl defined Cl as the 
maximum rate of common randomness (CR) that two nodes, 
observing sequences X” and K" separately can extract without 
any communication, i.e., 

Cgk{X;Y) := supi7T(/i(X”)), 

where the supremum is taken over all sequences of pairs 
of deterministic mappings such that P{/"(X") ^ 

f^{Y^)} 0 as n cx). 

The zero pattern of pxY is specified by its characteristic 
bipartite graph Bxy with the vertex set X uy and an edge 
connecting two vertices x and y if pxY > 0. If Bxy is a single 
connected component, we say that pxY is indecomposable. 
An ergodic decomposition of pxY is defined by a unique 
partition of the space Xxy into connected components 
ESI, ED, ESI. Given an ergodic decomposition of pxY 
such that X X y = Uo.^ > define the RV Q* as 

Q^, = g* -4=^ X e Xq^ Y e yq^. Eor any RV Q 

such that H{Q\X) = H(Q\Y) = 0, we have H{Q\Q.,) = 0 


so that Q^, has the maximum range among all Q satisfying 
H{Q\X) = H{Q\Y) = 0. In this sense, Q^, is the maximal 
common RVQ of X and Y. Remarkably, GK showed that 

CGKiX;Y)=HiQ,) (1) 

Thus, common GK codes cannot exploit any correla¬ 
tion beyond deterministic interdependence of the sources. 
Cgk{X;Y) depends solely on the zero pattern of pxY and 
is zero for all indecomposable distributions. 

The following double markovity lemma (see proof in 
Appendix A) is useful. 

Lemma 2. A triple of RVs (X,Y,Q) satisfies the double 
Markov conditions 

X-Y-Q,Y-X-Q (2) 

iff there exists a pmf pqi\xy such that H(Q'\X) = 

H{Q'\Y) = 0 and XY — Q' — Q. Furthermore, (2) implies 
I{XY-,Q) = H{Q') ijfH{Q’\Q) = 0. 

Remark 2. Eor all X,Y we have I{X;Y) = H{Q„) -\- 
I{X]Y\Qf). We say that pxY is saturable if /(X;K|(5*) = 0. 
Equivalently, pxY is saturable iff there exists a pmf Pq\xy 
such that X — Q — Y, Q — X — Y, Q — Y — X (see Lemma A1 
in Appendix A). We say that the triple {X,Y,Q) has a pairwise 
double Markov structure when the latter condition holds. 

The following alternative characterizations of Cgk{X;Y) 
follow from Lemma 2 ES- 

Cgk{X-Y) = ^ max_^J(Xy;Q) 

Q-Y-X 

= IiX;Y) - min I{X;Y\Q), (3) 

V — ^ ^ 

Q-Y-X 

where the cardinality of the alphabet Q is bounded as |Q| < 
1X113^1+2. 

Wyner E9l defined Cl as the minimum rate of CR needed 
to simulate a 2-DMS {X x V, Pxy) using local operations and 
no communication. More precisely, given access to a common 
uniform random string Qn ~ unif[l : 2"^] and indepen¬ 
dent noisy channels Fxn|Q^(a^"|(z) and (y^k) such 

that £-recovers (X”,F"), the Wyner Cl, denoted 

Cw(X;Y), is the minimum cost (in terms of the number 
of common random bits per symbol R) for the distributed 
approximate simulation of pxY. Cw(X;Y) admits an elegant 
single-letter characterization, 

CwiX;Y) := min UXY-Q) 

C^: A —(=/— Y 

= I{X-Y) + ^ min I(Y-Q\X) + /(X;Q|r), 

(4) 

where again \Q\ < \X\\y\ + 2. 

A related notion of common entropy, G{X,Y) is useful for 
characterizing a zero-error version of the Wyner Cl ET\ . 

G{X-Y) := ^ ^in_^H{Q) (5) 

Gray and Wyner (GW) E2l devised a distributed lossless 


is not hard to see the connection with Shannon’s notion of common 
information introduced earlier in Section II.A. In particular, we have Q* = 
X /\Y. Gacs and Korner independently proposed the notion of common 
information two decades following Shannon’s work (111 





source coding network for jointly encoding the 2-DMS into 
a common part (at rate Rc) and two private parts (at rates 
Rx and Ry), and separately decoding each private part using 
the common part as side information. The optimal rate region 
5 Rgw(-^;^) for this “two-decoder” network configuration is 
given by, 

({Rc,Rx,Ry) € : ^Pq\xy S Vxy, 

^gw{X;Y) = I s.t. Rc > liXYiQ), 

[ Rx>HiX\Q),Ry>H{Y\Q), 


where Vxy is the set of all conditional pmfs Pq\xy s-t. |Q| < 
jT’ll^l + 2. A trivial lower bound to follows from 

basic information-theoretic considerations ll22l . 

C S.G^xiX;Y) 

( {Rc,Rx,Ry) ■ Rc + Rx > H{X), 'I 

= < Rc + Ry>H{Y), Y 

i RQYRx+Ry>H{XY)] 

The different notions of Cl can be viewed as extreme points 
for the corresponding common rate Rc in the two-decoder 
networlfl, i.e., for {Rx,Ry,Rc) G 5RGw(-^;f^), we have 


Cgk{X;Y) = 

I{X-Y) = 
Cw{X-X) = 


max Rc, 

Rc+Ra,=H{X), Rc+Ry=H{Y) 

max Rc, 

2R^+R,,+Ry=H{X)+H{Y) 

min Rc- 

Rc+Ra,+Ry=HiX,Y) 


Remark 3. The different notions of Cl are related as, 
CGKiX;Y) < I{X-,Y) < CwiX;Y), with equality iff pxY 
is saturable, whence Cgr^XiY) = I{X-,Y) I{X;Y) = 
Cw(X;Y) (see Lemma A2 in Appendix A). 

Remark 4. CGKiXi;...;XK) is monotonically nonincreas¬ 
ing in the number of input arguments K. In contrast, 
Cw{Xi;...;Xk) is monotonically nondecreasing in K. It is 
easy to show that Cgk{Xi-,...-,Xk) < min I{Xi;Xj), while 

Cw{Xi;...;Xk) > max I{Xi;Xj) for any i,j € AT} 
(see Lemma A3 in Appendix A). 

Witsenhausen ll^ dehned a symmetric notion of private 
information. Witsenhausen?s total private information, denoted 
Mw{X-,Y), is dehned as the complement of Wyner’s Cl, 


Mw{X-,Y) ■■= H{XY) - Cw{X-,Y) = max H{XY\Q). 

Q: X — Q— 

One can dehne the private information of Y with respect 
to X (denoted Pw{Y\X)) as 

Pw{Y\X) := max H{Y\Q). (6) 

icj. -A. 1 

X-Y-Q 

Likewise, the complement of Pw{Y\X) is dehned as 

Cw{Y\X) := min H{Q). (7) 

A — Lj—y 
X-Y-Q 


The double Markov constraint (see Lemma 2) already hints 
at the structure of the minimize!' Q in (7). The following 
lemma (see proof in Appendix A) shows that the minimizer in 
Cw{Y\X) is a minimal sufficient statistic of Y with respect 
to X. 


Lemma 3. Let Qy denote a function f from y to the 
probability simplex Xx (the space of all distributions on X) 
that defines an equivalence relation on y: 

y = y' iff Px\Y{x\y) = Px\Y{x\y'), x e X, y,y' e y. 

Then Qy is a minimal sufficient statistic of Y with respect to 
X. 

Theorem 1 gives a decomposition of H{Y) into a part 
that is correlated with X (P[{Qy)) and a part that carries no 
information about X (iJ(L|(5y)) (see proof in Appendix A). 

Theorem 1. For any pair of correlated RVs {X,Y) ~ pxY, 
the following hold: 

Cw{Y\X) = H{Q§), (8a) 

Pw{Y\X) = H{Y\Q§), (8b) 

H{Y) = Cw{Y\X) + Pw{Y\X) = H{Q§) + H{Y\Q^), 

(8c) 

Cw{X-,Y) < Cw{Y\X). (8d) 

Let Xk L A’, C y, where A^'s and yk's having 
different subscripts are distinct (but not necessarily disjoint) 
subsets. Let {X,y) admit a unique decomposition into com¬ 
ponents {{Xk,yk)}Yi so that [jk=iXk = A and {ykVk^i 
is a partition of y induced by the equivalence relation in 
Lemma 3, i.e., Vy,y' e yk, x G Xk, y = y' and Vy G 
yk, X Xk, PY\x{y\x) = 0. We also require that each 
component is the “largest” possible in the sense that for any 
two components (Ai,3^i), {Xj,yj), there exists x' € XiU Xj 
such that px\Y{x'\yi) f px\Y{x'\yj)- The size of the com¬ 
ponent {Xk,yk) is dehned as \yk\- Given such a unique 
decomposition of {X,y) into components {{Xk,yk)}^=i, the 
following theorem gives necessary and sufficient conditions for 
Pw{Y\X) achieving its minimum and maximum value (see 
proof in Appendix A). 

Theorem 2. PwiY\X) achieves its minimum, PwiY\X) = 
0 iff there exist no component with size greater than one. 

On the other hand, Pw{Y\X) achieves its maximum, 
PwiY\X) = H{Y\X) iff pxY is saturable iff each compo¬ 
nent {Xk,yk) is a connected component induced by the ergodic 
decomposition of pxY- 

Example 4. P\y{Y\X) attains the lower bound for the fol¬ 
lowing distribution Pxy- ^ ~ {1)2,3,4},3^ = {5,6,7}. We 
write Pxyi.^,^) = (ab). Given, (15) = ^,(17) = |,(25) = 
;^,(27) = ;^,(35) = ^,(37) = i^,(46) = or graphically, 
ho 3 5 .\ 

PXY = ^ • • • f]- Let /(y) = Px\Y=y Then we 

\4 7 2 .J 

have /(5) = [§,1,^,0], /(6) = [0,0,0,1], and f{7) = 
[±^,^,0], so that HiQ§) = H{f{Y)) = = 

H{Y) = 1.15. Consequently Pvv{Y\X) = 0. One can also 
easily verify that H(Q^) = < H{X). 

The quantity CwiY\X) hrst appeared in IfTSl where it 
was called the dependent part of Y from X. Intuitively, 
CwiY\X) is the rate of the information contained in Y about 
X. CwiY\X) also appears in ||39]| . HOl and has the following 
coding-theoretic interpretation in a source network with coded 


See Problem 16.28-16.30, p. 394 in 1241 



side information setuf0 where X and Y are encoded inde¬ 
pendently (at rates Rx and Ry, resp.) and a (joint) decoder 
needs to recover X (with small error probability) using the 
rate-limited side information Y: CwiY\X) is the minimum 
rate Ry such that Rx = H{X\Y) is achievable ll^ . The 
following example shows that even though H(Y) admits a 
decompositioro of the form in (8c), it might not always be 
possible to isolate its parts ll39l . 

Example 5. Let X = {1,2} and y = {3,4,5,6}. Consider 
the perturbed uniform distribution pxY with (13) = ( 14 ) = 
(15) = (16) = i,(23) = i - 5,(24) = i + 5,(25) = | + 
5',(26) = i — 5', where 5,5' < ^. If 6 = S' = ^, H{Qy) = 
-ff(|,|) < H{Y). However, if 6 f S', then H{Qy) = H{Y). 
In fact, if S f S', as 5,5' —>■ 0, H{Qy) = H{Y) ss 2, while 
I{X\Y) —^ 0. Thus, even when I{X',Y) <C H{Y), one needs 
to transmit the entire Y (i.e., Ry > H(Y)) to convey the full 
information contained in Y about X. 

C. Related Common Information Measures 

We now briefly review some related candidate bivariate cor¬ 
relation measures. We highlight a duality in the optimizations 
in computing the various Cl quantities. 

Starting with Witsenhausen |[33l, the Hirschfeld-Gebelein- 
Renyi (HGR) maximal correlation ll34ll has been used to obtain 
many impossibility results for the noninteractive simulation 
of joint distributions ll35]l . The maximal correlation, denoted 
hgr(X;y), is a function of pxY{x,y) and is defined as 

hgT{X;Y) =E[f,iX)hiY)] 

where E[-] is the expectation operator and the supremum is 
taken over all real-valued RVs fi{X) and f 2 (Y) such that 
E[/i(X)] = E[/ 2 (y)] = 0 and E[/2(X)] = E[/|(y)] = 1. 
hgr(X;y) has the following geometric interpretation ll^ : if 
Lf{X,Y) is a real separable Hilbert space, then hgr(X;y) 
measures the cosine of the angle between the subspaces 
L^X) = {h{X) : E[/i] = 0, E[/2] < ooj and L\Y) = 
{/ 2 (F) : E[/ 2 ] = 0, E[/|] < 00 }. hgr(X;y) shares a 
number of interesting properties with I{X;Y), viz., (a) non¬ 
negativity: 0 < hgr(X;y) < 1 with hgr(X;y) = 0 iff 
X Y Y, and hgi{X;Y) = 1 iff Cgk{X;Y) > 0, i.e. 
iff pxy(x,v) is decomposable OSl, and (b) data processing: 
X' - X - Y -Y’ (hgr(X';y') < hgr(y;y)). 

Intuitively, for indecomposable distributions, if hgr(X;y) 
is near 1, then {X,Y) have still lots in common. Consider 
again the GK setup with node X observing X”, node y 
observing Y^, where (X",y") is generated by a 2-DMS 
{X X y, pxy)- Now, a (one-way) rate-limited channel is made 
available from node y to X. Then per 13^ . the maximum rate 
of CR extraction at rate R (denoted C{R)) is, 

C(R) = max I(Q;Y). 

PQiY- i{Q-,y)-i{Q-,x)<r 

We have Cgk{X;Y) = C(0) by definition. Hence, if i? = 0, 
for indecomposable sources, not even a single bit in common 
can be extracted ||33l. But if i? > 0, the first few bits of 
communication can “unlock” the common core of the 2-DMS. 
Assuming (7(0) = 0, the initial efficiency of CR extraction is 


®See Theorem 16.4, p. 361 and Problem 16.26, p. 393 in 1241 . 

*®From an information structure aspect, recall that owing to the nonmodu- 
laiity of the information lattice, even a unique decomposition into private and 
common information structures is not guaranteed (see Remark 1). 


given by iJTl 


(7'(0) = lim 


C{R) 

R 


1 

l-(s*(X;y))2’ 


where s*(7f;y) = max unvl ■ 

PQiv- HQX)>oR^X) 

Alternatively, given a 2-DMS {X x y, pxy), one can 
define the maximum amount of information that a rate R 
description of source y conveys about source X, denoted 
T(i?), that admits the following single-letter characterization 

El. 


T(i?) = max I{Q-,X), (9) 

PQir : HQX)<R 

where it suffices to restrict ourselves to pQ|y with alphabet Q 
such that |Q| < |y| -f 1. The initial efficiency of information 
extraction from source y is given by 

T'(0)=^^ 

_R4,0 

We have s^{X;Y) = 1 iff Cgk{X]Y) > 0 lE?). 

Interestingly, a dual of the optimization in (9) gives the 
well-known information bottleneck (IB) optimization lf38l that 
provides a tractable algorithm for approximating the minimal 
sufficient statistic of Y with respect to X {Qy in Lemma 3). 
For some constant e, the IB solves the nonconvex optimization 
problem, 

min I(Q;Y) (10) 

PQIY- I{Q'X)>e 

by alternating iterations amongst a set of convex distributions 

El. 

Since Cw{X;Y) is neither concave nor convex in Q, 
computation of Cw{X;Y) remains a difficult extremization 
problem in general, and simple solutions exist only for some 
special distributions ll^ . 


D. New Measures 

A symmetric measure of Cl that combines features of both 
the GK and Wyner measures can be defined by a RV Q as 
follows. 

C\X-Y) = min I{Y-Q\X) + I{X-,Q\Y) + I{X-X\Q), 

Pq\xy 

( 11 ) 

where it suffices to minimize over all Q such that |Q| < 
|<^||jy| + 2. Observe that C^(X;Y) = 0 if pxY is saturable. 
C^{X;Y) thus quantifies the minimum distance to saturability. 
However, C^{X\Y) is much harder to compute than the GK 
CL 

More useful for our immediate purposes is the following 
asymmetric notion of Cl for 3 RVs {Xi,X 2 ,Y) |[6|. 

= ( 12 ) 

It is easy to see that (7^ retains an important monotonicity 
property of the original definition of GK (see Remark 4) in 
that (7^ is monotonically nonincreasing in the number of input 
X7s, i.e., C^{{Xy...,XKhY) < CH{Xy...,XK-i};Y). 

One can also define the following generalization of the 
Wyner common entropy in (5). 

C=({Y.A};y) = ^^ ^ _™ ^ H(C) (13) 








It is easy to see that C^({Xi,X 2 };Y) > C^({Xi};Y) = 
G{Xi]Y) > Cw{Xi\Y) > I{Xi]Y). is monotonically 
nondecreasing in the number of input Xi’s. 

Any reasonable Cl-based measure of redundancy in the 
PI decomposition framework must be nonincreasing in the 
number of predictors. In the next section, we exclusively 
concentrate on C^. Better understanding of will guide our 
investigation in Section III in search of an ideal measure of 
redundancy for PI decomposition. 

III. Partial Information Decomposition: The Case 
EOR One Target and Two Predictor Variables 

Consider the following generalization of Shannon’s Ml for 
three RVs (Xi,X 2 ,V), called co-information ll43l or interac¬ 
tion information (with a change of sign) El. 

IcoiXi;X2;Y) = I(Xi;X2) - I{Xi;X2\Y) (14) 

Co-information is symmetric with respect to permutations of 
its input arguments and can be interpreted as the gain or 
loss in correlation between two RVs, when an additional 
RV is considered. The symmetry is evident from noting that 
I{Xi;X2) - IiX,;X2\Y) = - /(Xi;y|X2) = 

I{X 2 \Y) — I{X 2 ',Y\Xi). Given a ground set VL of RVs, the 
Shannon entropies form a Boolean lattice consisting of all 
subsets of n, ordered according to set inclusions ll42l . Co¬ 
informations and entropies are Mobius transform pairs with 
the co-informations also forming a lattice El- Co-information 
can however be negative when there is pairwise indepen¬ 
dence, as is exemplified by a simple two-input XOR function, 
Y — Xor(Xi;X 2 ). Bringing in additional side information Y 
induces artificial correlation between Xi and X 2 when there 
was none to start with. Intuitively, these artificial correlations 
are the source of synergy. Indeed, co-information is widely 
used as a synergy-redundancy measure with positive values 
implying redundancy and negative values expressing synergy 
ia-nD, El. However, as the following example shows, co¬ 
information confounds synergy and redundancy and is identi¬ 
cally zero if the interactions induce synergy and redundancy 
in equal measure. 

Example 6. Let X\ = X 2 = y = {1,2,3,4}. "We write 
Px, x^Y(a,b,c) ■= (abc). Consider the followine distribution: 
(111) = (i22) = (212) = (221) = (333) = (344) = 
(434) = (443) = g. First note that I{XiX 2 fY) = 2 bits. 
The construction PX 1 X 2 Y is such that one bit of information 
about Y is contained identically in both Xi and X 2 . The other 
bit of information about Y is contained only in the joint RV 
X 1 X 2 . Thus, Xi,X 2 contains equal amounts of synergistic and 
redundant information about Y. However, it is easy to check 
that Ico{Y;Xi;X 2 ) = /(V;Xi) - I{Y;Xi\X 2 ) = 0. 

It is also less clear if the co-information retains its intuitive 
appeal for higher-order interactions (> 2 predictor variables), 
when the same state of a target RV Y can have any combina¬ 
tion of redundant, unique and (or) synergistic effects 1431 . 

The partial information (PI) decomposition framework (due 
to Williams and Beer |l2l) offers a solution to disentangle the 
redundant, unique and synergistic contributions to the total 
mutual information that a set of K predictor RVs convey about 
a target RV. Consider the K = 2 case. We use the following 
notation: UI{{Xi};Y) and UI{{X 2 };Y) denote respectively, 
the unique information about Y that Xi and X 2 exclusively 
convey; /n({Xi,X 2 };V) is the redundant information about Y 


that Xi and X 2 both convey; SI{{XiX 2 }',Y) is the synergistic 
information about Y that is conveyed only by the joint RV 

(Xi,X2). 

The governing equations for the PI decomposition are given 

in (15) El, El- 

I{XiX2;Y) = /n({Xi,X2};y)+5/({XiX2};r) 

^ ^ ^ ^ > 
redundant synergistic 

+ (7J({Xi};y) + C//({X 2 };V) (15a) 

unique 

/(Xi;V) = /n({Xi,X2};y) + UIi{Xi};Y) (15b) 

IiX 2 ;Y) = /n({Xi,X 2 };y) + (7/({X2};y) (15c) 

Using the chain rule of Ml, (15a)-(15c) implies 

/(Xi;r|X2) = SIi{XiX2hY) + UI{{Xi};Y) (15d) 
I{X 2 -,Y\X,) = SI{{XiX 2 }-,Y) + UIi{X 2 };Y) (15e) 
I(Y;X,) + UI{{X2};Y) = I{Y-X2) + UI{{X^}-Y) (15f) 

From (15b)-(15e), one can easily see that the co-information is 
the difference between redundant and synergistic information. 
In particular, we have the following bounds. 

-min{J(Xi;y|X2),/(X2;y|Xi),J(Xi;X2|y)} 

< Jn({Xi,X2};r) - SI{{X^X2}-,Y) 

< min{/(Xi;r),J(X2;y),/(Xi;X2)} (15g) 

Equivalently, /n({Xi,X 2 };V) < SI{{XiX 2 };Y) when there 
is any pairwise independence, i.e., when Xi _L X 2 , or Xi _L 
Y, or X 2 -L Y, and In{{Xi,X 2 };Y) > SI{{XiX 2 };Y) when 
{Xi ,X 2 ,Y) form a Markov chain in any order, i.e., when Xi — 
Y - X 2 , 01 Xi - X 2 - Y or X 2 -X 1 - Y. The following 
lemma gives conditions under which In achieves its bounds. 

Lemma 4. 

a) If Xi -X 2 -Y, then /n({Xi,X 2 };r) = /(Xi;^). 

b) If X 2 - Xi - r, then /n({Xi,X 2 };r) = I{X 2 ;Y). 

c) If Xi- X 2 -Y and X 2 -X 1 - Y, then /n({Xi,X 2 }; 
Y) = I{Xi;Y) = I{X2;Y) = /(XiX2;V). 

d) If Xi-Y- X 2 , then /n({Xi,X 2 };r) > /(Xi;X 2 ). 

Proof: The proofs follow directly from (15b)-(15e) and 
the symmetry of co-information. ■ 

The following easy lemma gives the conditions under 
which the functions Ip, UI and SI vanish. 

Lemma 5. 

a) IfXi±Y or X 2 X Y, then /n({Xi,X 2 };r) = 0. Also, 
Xi T X 2 ^ /n({Xi,X2};y) = 0. 

b) If Xi - X 2 - Y, then UI{{Xi}-Y) = 0. Further, 
SI{{X^X 2 }-,Y) = 0, /n({Xi,X 2 };r) = I{Xi-Y), and 
UI{{X2hY) = I{X2,Y\Xf). 

c) If the predictor variables are identical or if either Xi — 
X 2 -V or Xa-Xi-r, then ^/({XiXa};!") = 0. Also, if 
y ^XixX 2 and Y = X 1 X 2 , then SI{{XiX 2 };Y) = 0 
and /n({Xi,X 2 };r) = /(Xi;X 2 ). 

Proof: The first part of a) is immediate from (15b) and 
(15c). The second part of a) is a direct consequence of the 
asymmetry built in the PI decomposition by distinguishing the 
predictor RVs (Xi,X 2 ) from the target RV (Y). Indeed, Xi _L 
X 2 merely implies that /n({Xi,X 2 };V) = 5'/({XiX2};V) — 
/(Xi;X 2 |V); the RHS does not vanish in general. Part b) and 
c) follow directly from (15b)-(15e). ■ 





We visualize the PI decomposition of the total mutual infor¬ 
mation I{XiX2\Y) using a P/-diagram ||2|- As detailed below, 
Fig. 1 shows the P/-diagrams for the “ideal” PI decomposition 
of several canonical functions, viz., COPY (and its degenerate 
simplifications Unq and Rdn), Xor and And Q, Q. Each 
irreducible PI atom in a P/-diagram represents information 
that is either unique, synergistic or redundant. Ideally, one 
would like to further distinguish the redundancy induced by the 
function or mechanism itself {csWtd functional or mechanistic 
redundancy) from that which is already present between the 
predictors themselves (called predictor redundancy). However, 
at present it is not clear how these contributions can be disen¬ 
tangled, except for the special case of independent predictor 
RVs when the entire redundancy can be attributed solely to the 
mechanism El- 

Example 7. Consider the COPY function, Y = 
COPY(Xi,Ai 2 ), where Y consists of a perfect copy of 
Xi and X 2 , i.e., Y = X 1 X 2 with y = Xi x X 2 . The COPY 
function explicitly induces mechanistic redundancy and we 
expect that MI between the predictors completely captures 
this redundancy, i.e., /n({Xi,X2};(Xi,X2)) = I{Xp,X2). 
Indeed, Lemma 5(c) codifies this intuition. 

(a) Fig. 1(a) shows the ideal PI decomposition for 

the distribution PX 1 X 2 Y with (00 “ 00 ”) = (01 “ 01 ”) = 

(11“11”) = where {ab“ab”) ■= pxiX 2 Y{a,b,ab). We 
then have In{{Xi,X 2 ]\{Xi,X 2 )) = I{Xi-X 2 ) = -f.252, 
SI{{XiX2}-,Y) = 0 and UI{{XiyfY) = UI{{X2yY) = 
-f.667. 

(b) Fig. 1(b) shows the ideal PI decomposition for a 
simpler distribution PX 1 X 2 Y with ( 00 “ 00 ”) = (01 “ 01 ”) = 
(10“10”) = (11“11”) = Now Y consists of a perfect copy 
of two Lid. RVs. Clearly, Ia{{Xi,X 2 };{Xi,X 2 )) = 0. Since 
SI{{XiX 2 y,Y) = 0 (vide Lemma 5(c)), only the unique con¬ 
tributions are nonzero, i.e., UI({Xi}]Y) = UI{{X 2 }',Y) = 
-fl. We call this the \Jt<IQ function. 

(c) Fig. 1(c) shows the ideal PI decomposition for the 
distribution PX 1 X 2 Y with (000) = (111) = 5 . This is an 
instance of a redundant COPY mechanism with Xi = X 2 = Z, 
where Z = Bernoulli(i), so that Y = Xi = X 2 = Z. We 
then have I(XiX 2 ]Y) = In{{Xi,X 2 ]iXi,X 2 )) = 1. We call 
this the Rdn function. 

Example 8. Fig 1(d) captures the PI decomposition of the 
following distribution: Y — Xor(Xi,X 2 ), where Xi = 
Bernoulli(|), i = 1,2. Only the joint RV X 1 X 2 specifies 
information about Y, i.e., I{XiX 2 ,Y) = 1 whereas the 
singletons specify nothing, i.e., I(Xi]Y) = 0, i = 1,2. Neither 
the mechanism nor the predictors induce any redundancy 
since Id{{Xi,X 2 }]Y) = 0. XOR is an instance of a purely 
synergistic function. 

Fig. 1(e) shows the ideal PI decomposition for the follow¬ 
ing distribution: Y = (Xor(X(,X 2 ),(X(',X 2 ),X), where the 
predictor inputs are Xi = (XyXI ,Z) and X 2 = (XyX'f ,Z) 
with X'l^XyX'i ^X'f ,Z i.i.d. The total MI of 4 bits is dis¬ 
tributed equally between the four PI atoms. We call this 
the RdnUnqXOR function since it is a composition of the 
functions Rdn, UnQ and XOR. Also see Example 6 which 
gives an instance of composition of functions Rdn and XOR. 

Example 9. Fig 1(f) shows the PI decomposition of the 
following distribution: Y = And(Xi,X 2 ), where Xi = 
Bernoulli(i), i = 1,2 and PX 1 X 2 Y A such that (000) = 





Fig. 1. P/-diagrams showing the “ideal” PI decomposition of I (X 1 X 2 ; Y) for 
some canonical examples. {1} and {2} denote, resp. unique information about 
Y, that Xi and X 2 exclusively convey; {1,2} is the redundant information 
about Y that Xi and X 2 both convey; {12} is the synergistic information 
about Y that can only be conveyed by the joint RV (X\,X 2 ). (a) COPY (b) 
Unq (c) Rdn (d) Xor (e) RdnUnqXor (f) And (see description in text) 


(010) = (100) = (111) = y The decomposition evinces both 
synergistic and redundant contributions to the total ML The 
synergy can be explained as follows. First note that Xi _L X 2 , 
but Xi / X 2 \Y since I{Xi;X 2 \Y) = -f.l89 f 0. Fixing 
the output Y induces correlations between the predictors Xi 
and X 2 when there was none to start with. The induced 
correlations are the source of positive synergy. 

Perhaps, more surprisingly, redundant information is not 
0 despite that Xi _L X 2 . The redundancy can be explained 
by noting that if either predictor input Xi — 0 or X 2 = 0, 
then both Xi and X 2 can exclude the possibility of Y = 1. 
Hence the latter is nontrivial information shared between Xi 
and X 2 . This is clearer in light of the following argument 
that uses information structure aspects. Given the support of 
PX 1 X 2 Y, the set of possible states of Nature include H = 
{(000),(010),(100),(111)} := {wi,a; 2 ,a; 3 ,a; 4 }. Xi generates 
the information partition Vxi = Likewise, X 2 

generates the partition, Vx 2 = Wia; 3 |a; 2 W 4 . Let the true state 
of Nature be uji. Consider the event E = {a;i,W 2 ,W 3 }. Both 
Xi and X2 know E at uii, since 7 ^Xi(wi) = {a;i,a;2} C E 
and 7^X2 (‘Vi) = {wijWs} C E. The event that Xi knows E is 
KxiiE) = {a;i,a;2}. Likewise, the event that X2 knows E is 
Kx 2 {E) = {wijWs}. Clearly, the event Kxi{E) Cl Kx 2 {,E!) = 
{wi} is known to both Xi and X 2 , so that Y = 1 can be 
ruled out with probability of agreement one. 

Indeed, for independent Xi and X 2 , when one can attribute 



Fig. 2. /^/-diagram for the decomposition of Massey’s directed information 
(DI). The colored areas correspond to the local DI term I{X^ 
where {1} = UI{{X%Yi), {12} = SI{{X^Y^-^},Yi), {1,2} = 
In({X\Y*-^},Yi), and {2} = UI{{Y*-^}]Yi) (see text) 


the redundancy entirely to the mechanism, there is some 
consensus that In{{Xi,X 2 }]Y) = |log| = +.311 and 
si{{x,x2y,Y) = +.5 m, 0, Ml?- 

Remark 5. Independence of the predictor RVs implies a 
vanishing predictor redundancy but not necessarily a vanishing 
mechanistic redundancy (also see second part of Lemma 5(a)). 

As one final illustrative application of this framework, we 
consider the decomposition of Massey’s directed information 
(DI) IfT^ into PI atoms. 

Example 10. For discrete-time stochastic processes X^ and 
Y^, the DI from X to Y is defined as follows. 

N 

I{X^ Y^) ■■= 

i=l 

where X* — |Xi,Xi_i,...} denotes the past of X relative 
to time i. I(X" —>■ Y^) answers the following operational 
question: Does consideration of the past of the process X^ 
help in predicting the process Y^ better than when consider¬ 
ing the past of Y^ alone ? DI is a sum of conditional mutual 
information terms and admits an easy PI decomposition. 

I{X^;Y,\Y^-y 

= . UI({X^yY^) + SI{{X^Y^-^}-Yf), 

( 16 ) 

where we have used (15d) with Xi ■■= X*, X 2 — and 

Y := Y,. 

The decomposition has an intuitive appeal. Conditioning 
on the past gets rid of the common histories or redundancies 
shared between X® and and adds in their synergy. Thus, 
given the knowledge of the past information gained from 

learning X® has a unique component from X® alone as well 
as a synergistic component that comes from the interaction 
of X® and y®“^. The colored areas in Fig. 2 shows this 
decomposition of the “local” DI term I{X^ —>■ Y^)(i) into 
PI atoms, where I(X^ Y^) = 

From (15a)-(15c), it is easy to see that the three equations 
specifying /(XiX 2 ;X), /(Xi;y) and I{X2]Y) do not fully 
determine the four functions in({Xi,X 2 };X), UI({Xi\\Y), 
UI({X 2 }]Y) and 5'/({XiX2};X). To specify a unique de¬ 
composition, one of the functions In, SI or UI needs to be 
defined or a fourth equation relating Ip, SI, and UI. 

PI decomposition researchers have focused on axiomati- 
cally deriving measures of redundant IS), |l3], Q-lll, ifTOl . 
synergistic and unique information il, ii. For in¬ 


stance, for a general K, any valid measure of redundancy 
/n(Xi,...,X/f;X) must satisfy the following basic properties. 
Let Ri,...,Rk C {Xi,...,Xk}, where k < K. 

(GP) Global Positivity; Ini{Ri,...,Rk};Y) > 0. 

(S) Symmetry: In{{Ri,...,Rk}',Y) is invariant under re¬ 
ordering of the Xfs. 

(I) Self-redundancy: In{R;Y) = I(Xb,-,Y). For in¬ 
stance, for a a single predictor Xi, the redundant 
information about the target Y must equal I{Xi\Y). 

(M) Weak Monotonicity: Y{{Ri,...,Rk-i,Rk}]Y) < 

Ir{{Ri, ■■■,Rk-i}',Y) with equality if 3 Ri G 
{Ri,...,Rk} such that H(RiRk) = H(Rk). 

(SM) Strong Monotonicity; < 

■■■,Rk-i}',Y) with equality if 3 G 
{Ri,...,Rk} such that I{RiRk]Y) = I{Rk]Y). For 
the equality condition for K = 2, also see Lemma 
4(a)-(c). 

(LP) Local Positivity; For all K, the derived PI mea¬ 
sures are nonnegative. For instance for K — 2, a 
nonnegative PI measure for synergy requires that 
I(XiX 2 \Y) > /u({Xi,X 2 };r), where /y is the 
union information which is related to In (for any 
K) by the inclusion-exclusion principle ||2l. 

(Id) Identity; For K = 2, /n({Xi,X 2 };(Xi,X 2 )) = 
/(Xi;X2) m. 

The following properties capture the behavior of an ideal 
In when one of the predictor or target arguments is enlarged. 

(TM) Target Monotonicity: If H{Y\Z) = 0, then 

In{{Ri,...,RkhY) < In{{Ri, ...,RkhZ). 

(PM) Predictor Monotonicity: If H(Ri\R[) = 0, then 
In{{Ri,...,Rk};Y) < In{{R[,R 2 ,.:,Rk};Y). 

A similar set of monotonicity properties are desirable of 
an ideal UI. We consider only the K = 2 case and write 
UIx 2 {{Xi}',Y) to explicitly specify the information about Y 
exclusively conveyed by Xi. 

(TMu) Target Monotonicity: If H{Y\Z) = 0, then 

C/Jx.({Xi};F) <[//x,({Xi};Z). 

(PMu) Predictor Monotonicity; If H(Xi\X[) = 0, then 
C/Jx.({Xi};F) <[//x,({X{};y). 

(PMJj) Predictor Monotonicity with respect to the comple¬ 
ment; If £r(X 2 |X^) = 0, then UIx'i{Xi}-X) < 
C/Jx.({Xi};F). 

Properties (M) and (SM) ensure that any reasonable mea¬ 
sure of redundancy is monotonically nonincreasing with the 
number of predictors. For a general K, given a measure of 
redundant information that satisfies (S) and (M), only those 
subsets need to be considered which satisfy the ordering 
relation Ri ^ Rj^i ^ j (i.e., the family of sets Ri,...,Rk 
forms an antichain) 12, E]. Define a partial order A on the 
set of antichains by the relation: {Si,...,Sm) Y (Ri,---,Rk) 
iff for each j = l,...,k 3 i < m such that Si C Rj. 
Then, equipped with Y, the set of antichains form a lattice 
£ called the PI or the redundancy lattice 0. By virtue 
of (M), for a fixed Y, In{{Ri,...,Rk}',Y) is a monotone 
function with respect to A. Then, a unique decomposition of 
the total mutual information is accomplished by associating 
with each element of £ a PI measure Iq which is the 
Mobius transform of In so that we have In{{Ri,---,Rk}',Y) = 
E Io{Si,...,S^;Y). 

For instance, for K = 2 (see (15a)), the PI 


measures are lQ{{Xi,X 2 yX) = /n({Xi,X 2 };F), 

h{{X,}-Y) = t//({Xi};r), mX^yX) = UI{{X2yY), 

and l9{{XiX2yY) = SI{{XiX 2 yY)- 

While elegant in its formulation, the lattice construction 
does not by itself guarantee a nonnegative decomposition of 
the total mutual information. The latter depends on the chosen 
measure of redundancy used to generate the PI decomposi¬ 
tion. Given a measure of redundant information, some of the 
recurrent pathologies reported thus far include incompatibility 
of properties (a) (LP) and (TM) i), g), Q, HI, g), (b) 
(LP) and (Id) for iT > 3 g), and (c) (TM) and (Id) for 
K = 2, whenever there is mechanistic dependence between the 
target and the predictors 13. For a nonvanishing mechanistic 
dependency, (TM) and (Id) are incompatible since together 
they imply Ir,i{Xi,X 2 ylXi,X 2 )) < Ii^iX 2 )- For example, 
the desired decomposition of And in Example 9 contradicts 
(TM). None of the measures of In proposed thus far satisfies 
(TM). In the next section, we restrict ourselves to the bivariate 
case as some of the pathological features are already manifest. 


A. Measures of Redundant Information Based on Common 
Information 

In this section, we dwell on the relationship between 
redundant information and the more familiar information- 
theoretic notions of common information. In particular, we 
seek to answer the following question: can optimization over a 
single RV yield a plausible measure of redundancy that satisfies 
(LP)? 

A simple measure of redundant information between pre¬ 
dictors {Xi,X 2 ) about a target RV Y is dehned as follows 
0. 




max I(Q;Y) 

Q-. HiQ\Xi)=HiQ\X2)=0 


= I{XiAX2;Y) 


(17) 


/p satishes (GP), (S), (I), (M) and (TM) but not (Id) 0. 
/p inherits the negative character of the original dehnition 
of GK and fails to capture any redundancy beyond a certain 
deterministic interdependence between the predictors. Unless 
PX 1 X 2 is decomposable, l}^{{XiX 2 }X) is trivially zero, 
even if it is the case that the predictors share nontrivial 
redundant information about the target Y. Furthermore, 
violates (LP)0 and is too restrictive in the sense that it does 
not capture the full informational overlap. 

One can relax the constraint in (17) in a natural way by us¬ 
ing the asymmetric notion of Cl, C^({Xx,X 2 }',y) introduced 
earlier in (12). For consistency of naming convention, we call 
this /p. 

/A({Xi,X 2 };r) = max I{Q;Y) (18) 

Cj- CJ — A1 — Y 
Q-X 2 -Y 

The dehnition has an intuitive appeal. If Q specihes the 
optimal redundant RV, then conditioning on any predictor Xi 
should remove all the redundant information about Y, i.e., 
I{Q',Y\^i) = 0, i = 1,2 0. Ip remedies the degenerate 
nature of Ip with respect to indecomposable distributions 0. 
It is also easy to see that the derived unique information 
measure, UX is nonnegative. 

Ul\{XxyY) = ^ min I{Xx;Y\Q) (19) 

C^: Ai — Y 

Q-X 2 -Y 


UP readily satishes the symmetry condition (15f) since 




Fig. 3. /-diagrams for proofs of (a) Lemma 6 and (b) Lemma 7. Denoting 
the /-Measure of RVs {Q,Xi,X 2 ,y) by fi*, the atoms on which fi* vanishes 
are marked by an asterisk (see text) 


given Q such that Q — Xi — Y and Q — X 2 — Y, 
we have I(Xi;Y) + I(X 2 ;YlQ)=I(QXi;Y) + I(X 2 ;YlQ) = 

I(QX 2 ;Y) + I(Xi;YIQ)=I(X 2 ;Y) + I(Xi;YlQ), where (a) 
follows from Q — Xi — Y and (b) follows from Q — X 2 — Y. 

For the proofs of Lemma 6 and 7 to follow, we shall use the 
standard facility of Information diagrams (/-diagrams) ll25l . 
For hnite RVs, there is a one-to-one correspondence between 
Shannon’s information measures and a signed measure /r* 
over sets, called the /-measure. We denote the /-Measure 
of RVs (Q,Xi,X 2 ,V) by /r*. For a RV X, we overload 
notation by using X to also label the corresponding set in 
the /-diagram. Note that the /-diagrams in Fig. 3 are valid 
information diagrams since the sets QX 1 X 2 X intersect each 
other generically and the region representing the set Q splits 
each atom into two smaller ones. 

Lemma 6. If Xi _L X 2 , then I^{{Xi,X 2 yY) = 0. 

Proof: The atoms on which /r* vanishes when the Markov 
chains Q — Xi — Y and Q — X 2 — Y hold and Xi _L X 2 are 
shown in the generic /-diagram in Fig. 3(a); /r*((5 n V) = 0 
which gives the result. ■ 

Lemma 7. IfXi-Y-X 2 , then I^{{Xi,X 2 };Y) < I{Xi;X 2 ). 

Proof: The atoms on which /r* vanishes when the Markov 
chains Q — Xi —Y, Q — X 2 — Y and Xi — Y — X 2 hold are 
shown in the /-diagram in Fig. 3(b). In general, for the atom 
Xi 0 X 2 nV, /r* can be negative. However, since Xi—Y — X 2 
is a Markov chain by assumption, we have /i*(XinX 2 nV) = 
X{Xi n X 2 ) > 0 . Then n V) < X{Xi n xy, which 
gives the desired claim. ■ 

By Lemma 7, /p already violates the requirement posited 
in Lemma 4(d) for an ideal /p. It turns out that we can make 
a more precise statement under a stricter assumption, which 
also amounts to proving that /p violates (Id). 

Lemma 8. Let y = Xi x X 2 and Y = X 1 X 2 . Then 
/2({Xi,X2};y) = Cgk{Xx-X 2) < I{Xi;X2). 

Proof: Lirst note that 

/ 2 ({Xi,X 2 };XiX 2 ) = inax I{QXiX 2 ) 

(q/: C^—Ai—A1A2 
Q-X 2 -X 1 X 2 

= max I{Q\Xi). 
Q-.Q-Xi-X2 
Q-X2-Xi 

Lrom Lemma 2 we have that, given pq\x 1 X 2 such that 
Q - Xi - X 2 and Q - X 2 - Xi, 3 Pq>\XiX 2 such that 


H{Q'\Xi) = H[Q'\X 2 ) = 0 and X 1 X 2 - Q' - Q. Then 
/(Q;Xi) = /(Q;XiX2) = m-X^X^Q') = /(Q';Q) < 
H{Q'). Q' is the maximal common RV of Xi and X 2 . Thus, 

we have max I(Q\Xi) = H{Q') < I{Xi]X 2 ), with 
Q: Q—X1—X2 
Q-X2-X-, 

equality iff Xi — Q — X 2 , or equivalently, iff {Xi,X 2 ) is 
saturable (see Remark 3). ■ 

Remark 6. Consider the Gdcs-Korner version of I^: 
C2({Xi,X2};r) = umx_^CGK{Q;Y). 

' Q-xl-Y 

Interestingly, satisfies (Id) in the sense that 

C^({Xi,X 2 };XiX 2 ) = Cgk{Xi;X 2 ), or equivalently, 

max Cgk{Q]XiX 2 ) = Cgk{Xi',X 2 ). To show 
Q: Q—X1—X2 

Q-X2-X1 

this, we again use Lemma 2. Clearly the Q that achieves 
the maximum is the maximal common RV Q' so that we 
have, LHS = Cgk(,Q'XX 2 ) = Cgk{Xi L X 2 ]XiX 2 ) = 
H{XiLX2LXiX2) = H{XiLX2) = Cgk{XuX2) = RHS. 

Proposition 1. satisfies (GP), (S), (I), (M), and 

(SM) but not (LP) and (Id). 

Proof: 

(GP) Global positivity follows immediately from the 
nonnegativity of mutual information. 

(S) Symmetry follows since is invariant under reorder¬ 
ing of the Xfs. 

(I) If Q - Xi - Y, then I(Q-,Y) < HXi'Y). Then, 
self-redundancy follows from noting that I^{{Xi}\Y) = 

(M) We first show that l'^{{Xi,X 2 }\Y) < l‘^{{Xi}-Y). 

This follows immediately from noting that max I{Q;Y) 

Q: Q—Xi —Y 
Q-X2-Y 

< max I(Q\Y), since the constraint set for the LHS is 

“ Q: Q-Xx-Y ^ 

a subset of that for the RHS and the objective function for the 
maximization is the same on both sides. 

For the equality condition, we need to show that if 
H{Xi\X 2 ) = 0, then ({Xi,X 2 };y) = l‘^{{X^}-Y). 
It suffices to show that if H(Xi\X 2 ) = 0, then 

max I(Q:Y) > max I(Q:Y). This holds since 
Q-. Q-Xi-Y Q-. Q-Xi-Y 

Q-X2-Y 

if H{Xi\X 2 ) = 0, then Q-Xi-Y Q-X 2 -Y. 

(SM) Since (M) holds, it suffices to show the equal¬ 
ity condition. For the latter, we need to show that if 
i\XiX 2 -,Y) = I{X 2 ]Y) or equivalently, if Xi - X 2 - Y, 
then I^{{Xi,X 2 }]Y) = I^{{Xi};Y). This follows from 
(a) (b) 

noting that I{Q',Y)<I(Xi;Y)<I{X 2 ',Y), where (a) follows 
from Q — Xi — Y and (b) follows from Xi — X 2 — Y. 
Hence, we have l‘^{{Xi,X 2 }',Y) = max I(Q;Y) = 

Q- Q — Xi—Y 
Q-X2-Y 

I(Xi;Y)=I^({Xi}-,Y), where (c) follows from (I). 

(LP) Proof by counter-example: We show that if Xi — 
Y — X 2 , then (LP) is violated. First note that if Xi — Y — 
X 2 , then using the symmetry of co-information, the derived 
synergy measure is SP{{XiX 2 };Y) = l'^({Xi,X 2 ]\Y) - 
I{Xi\X 2 ). From Lemma 7, it follows that l'^({Xi,X 2 }]Y) < 


I{Xi\X 2 ) so that SP{{XiX 2 }\Y) < 0. Hence, there exists 
at least one distribution such that (LP) does not hold, which 
suffices to say that (LP) does not hold in general. 

Indeed, the COPY function in Example 7 provides a 
direct counterexample, since I^{{Xi,X 2 }',Y) = 0 and 

SP({XiX 2 }]Y) = —I{Xi\X 2 ) < 0. Not surprisingly, the 
derived synergy measure exactly matches the deficit in mech¬ 
anistic redundancy that fails to capture. 

(Id) By Lemma 8, violates (Id). 

■ 

Proposition 2. satisfies (PM) but not (TM). 

Proof: 

(TM) We need to show that if H{Y\Z) = 0, 
then, I^{{Xi,X 2 }\Y) < I^{{Xi,X 2 };Z), or equivalently, 

max I(Q:Y) < max I(Q:Z). The latter does not 
Q-.Q-Xj-Y Q-.Q-Xi-Z 

Q-X2-Y Q-X2-Z 

hold since Q — Xi — Z =y- Q — Xi — Y, i = 1,2, but the 
converse does not hold in general. Hence, violates (TM). 

(PM) We need to show that if H(Xi\X[) = 0, 
then I‘^{{Xi,X 2 }]Y) < l‘^({X[,X 2 }\Y), or equivalently, 

max I(Q:Y) < max I{Q\Y). The latter holds 
Q-.Q-Xi-Y q.,q_x[-Y 

Q-X2-Y Q-X2-Y 

since Q — Xi — Y =y- Q — X'l—Y. Since is symmetrical 
in the Xf^, satisfies (PM). ■ 

Proposition 3. UP satisfies (TMu) and (PM))) but not 

(PMu). 

Proof: 

(TMu) We need to show that if H{Y\Z) = 0, 

then UIy ({TfijiF) < Uly ({Xi}:Z), or equivalently, 

min I{X, Y\Q) < min I The latter 

Q-.Q-Xi-Y Q-.Q-Xi-Z 

Q-X2-Y Q-X2-Z 

holds since Q — Xi — Z Q — Xi — Y, i = 1,2 and 

IiXi;Y\Q) < IiX,;Z\Q). 

(PM|)) We need to show that if H{X 2 \X 2 ) = 0, 
then UI^^{{Xi};Y) > UI^,({Xi};Y), or equivalently, 

min I(Xi:Y\Q) > mm I(Xi:Y\Q). The latter 

Q-X2-Y Q-x(-Y 

holds since Q — X 2 — Y Q — X 2 — Y. 

(PMu) We need to show that if H{Xi\X[) = 0, 
then Ul‘^^{{Xi}-Y) < C//|.J{X{};F), or equivalently, 

min I(Xi:Y\Q) < min I{X[:Y\Q). The latter 

Q-.Q-Xi-Y Q:Q_x[-Y 

Q-X2-Y Q-X2-Y 

does not hold since Q — Xi — Y => Q — X[ —Y, but the 
converse does not hold in general. ■ 

B. Comparison with Existing Measures 

For the K = 2 case, it is sufficient to specify any one of the 
functions In, UI or SI to determine a unique decomposition 
of I{XiX 2 ;Y) (see (15a)). Information-geometric arguments 
have been forwarded in Q, 13 to quantify redundancy. We 
do not repeat all the definitions in IT). However, for the sake 
of exposition, we prefer working with the unique information 
since geometrically, the latter shares some similarities with the 
mutual information which can be interpreted as a weighted 
distance. 

I(Xi-,Y) = ^^^^pxdx)DiPY\X^=xA\PY)- 


Given a measurement of a predictor, say Xi = xi, unique 
information is defined in terms of the reverse information 
projection ED of PY\Xi=xi oil Ih® convex closure of the set 
of all conditional distributions of Y for all possible outcomes 
of X 2 . 

UI^{{Xi};Y) = pxAx)JnmD{pY\x^=x^\\Q), 

( 20 ) 


where A is the convex hull of < py\X 2 =X 2 f ’ '■f*^ family of 

all conditional distributions of Y given the different outcomes 
of A 2 . Since the KL divergence is convex with respect to 
both its arguments, the minimization in (20) is well-dehned. 
It is easy to see however, that UI^ violates the symmetry 
condition (15f) unless the projection is guaranteed to be 
unique. Uniqueness is guaranteed only when the set we are 
projecting onto is log-convex ETl . In particular, (20) only 
gives a lower bound on the unique information so that we 
have, 

UI^{{Xi};Y) >Y^ pxi(a;)minL»(pY|Xi=xillQ)- 

^ — 'xeXi QeA 


The symmetry is restored by considering the minimum of 
the projected information terms for the derived redundant 
information Q. 

/3({Xi,A 2};F) = min[(/(Xi;y) - UI^{X,y,Y)), 

(/(X2;r)-C//3({A2};r))]. (21) 


satisfies (GP), (S), (I), (M), (LP) and (Id) but not (TM) 

The following measure of unique information is proposed 
in 0. 




ui^{{x,y,Y) 


max 

{X[,X2,Y'}- Px[Y'=PXiY 

Px^Y'=PX2Y 




( 22 ) 


The derived redundant information is I^{{Xi,X 2 };Y) = 
I{Xp,Y) - C//4({Xi};r). P satisfies (GP), (S), (I), (M), 
(LP) and (Id) but not (TM) i). 

Proposition 4 shows that both and satisfy (SM). 

Proposition 4. and satisfy (SM). 

Proof: See Lemma 13 and Corollary 23 in 0. ■ 

It is easy to show that violates (SM) (see Example 
ImperfectRdn in 0). Table 1 lists the desired properties 
satisfied by 7^, J^, 0 and 0. 

The following proposition from 0 gives the conditions 
under which and /p vanish. 


Proposition 5. If both Xi —Y — X 2 and Xi _L X 2 hold, then 
= 0 . 

Proof: See Corollary 10 and Lemma 21 in 0. ■ 


In general, the conditions for which an ideal Ip van¬ 
ishes are given in Lemma 5(a). Indeed, if both Xi — Y — 
X 2 and Xi _L X 2 hold, then from (15g) we have that 
Ini{Xi,X 2 }-,Y) - SI{{XiX 2 yY) = 0, so that Ip > 0 in 
general (also see Lemma 4(d)). However, we have not been 
able to produce a counterexample to refute Proposition 5 (for 
an ideal In)- We conjecture that the conditions Xi —Y — X 2 
and Xi _L X 2 are ideally sufficient for a vanishing In- 


TABLE I. Desired properties oe In satiseied by the CI-based 

MEASURES p AND 7^, AND THE EARLIER MEASURES Q AND i) 



Property 

In 

72 

-'n 

73 

-'n 

-'n 

(GP) 

Global Positivity 

/ 

Y 

Y 

/ 

(S) 

Weak Symmetry 

Y 

Y 

Y 

Y 

(I) 

Self-redundancy 

Y 

Y 

Y 

Y 

(M) 

Weak Monotonicity 

Y 

Y 

Y 

Y 

(SM) 

Strong Monotonicity 


Y 

Y 

Y 

(LP) 

Local Positivity 



Y 

Y 

(Id) 

Identity 



Y 

Y 


Proposition 5 highlights a key difference between /p and 
the related measures Ip and Ip. By Lemma 6, we have that 
Ip vanishes if Xi _L X 2 - Clearly, unlike Ip and Jp, Jp is not 
sensitive to the extra Markov condition Xi — Y — X 2 - This 
is most clearly evident for the And function in Example 9, 
where we have I{Xi;X 2 ) = 0 and I{Xi\X 2 \Y) = -I-.189. 
Lemma 6 dictates that Ip = 0 if Xi _L X 2 and the ensuing 
decomposition is degenerate. Thus, for independent predictor 
RVs, if F is a function of Xi and X 2 when any positive 
redundancy can be attributed solely to the common effect Y, 
Ip fails to capture the required decomposition (see Remark 
5). When the predictor RVs are not independent, a related 
degeneracy is associated with the violation of (Id) when /p 
fails to attain the mutual information between the predictor 
RVs (see the COPY function in Example 7). Indeed, by Lemma 
8, In{{Xi,X 2 }\Y) = I{Xi]X 2 ) iff PX 1 X 2 is saturable. Also 
by Lemma 7, /p violates the requirement posited in Lemma 
4(d) which generalizes the (Id) property. Interestingly, Lemma 
6 also shows that any reasonable measure of redundant infor¬ 
mation cannot be derived by optimization over a single RV. 

We give one final example which elucidates the subtlety of 
the PI decomposition problem from a coding-theoretic point 
of view. Consider the distribution in Example 11, where Y = 
COPY(Ai,A 2 ). The PI decomposition of I{XiX 2 ]Y) in this 
case reduces to the decomposition of H(X 1 X 2 ) into redundant 
and unique information contribution^ 

Example II. Let Xi — {1,2} and X 2 = {3,4,5,6}. Let Y = 
X 1 X 2 with y = Xi X X 2 - Consider the distribution PX 1 X 2 Y 
with (13 “13”) = (14 “14”) = (15 “15”) = (16 “16”) = 2 
(23 “23”) = I - (24 “24”) = 2+5, (25 “25”) = | + 5', 
(26“26”) = 2 — 5', where 5,5' < | and {ab“aP) ■= 
PXiX 2 Y(ci,b,ab). If S f 5', as 5,5' —>■ 0, we have the 
following ideal PI decomposition: In(\Xi,X 2 \',(Xi,X 2 )) = 
I{XyX2) ^ 0, SI{{XiX2yY) = 0, C/J({Ai};F) = 
H{Xi\X 2 ) ~ +1.0 and UI{{X 2 yY) = H{X 2 \Xi) ^ +2.0. 

Consider again the source network with coded side information 
setup ll^ where predictors Xi and X 2 are independently 
encoded and a joint decoder wishes to losslessly reconstruct 
only Xi, using the coded X 2 as side information. It is tempting 
to assume that a complete description of Xi is always possible 
by coding the side information at a rate Rx 2 — I{Xi;X 2 ) 
and describing the remaining uncertainty about Xi at rate 
Rxi = H{Xi\X 2 )- Example 11 provides an interesting 


"Also see Example 5 . 






counterexample to this intuition. Since the conditional distri¬ 
butions pX i\X 2 {'\^' 2 ) ^6 different for all X 2 € X 2 , we have 
Rx 2 > H{X 2 ) (see Theorem 2). Consequently one needs to 
fully describe X 2 to (losslessly) recover Xi, even if it is the 
case that I(Xi;X 2 ) is arbitrarily small. Therefore, separating 
the redundant and unique information contributions from X 2 
is not possible in this case. 

C. Conclusions 

We hrst took a closer look at the varied relationships 
between two RVs. Assuming information is embodied in 
CT-algebras and sample space partitions, we formalized the 
notions of common and private information structures. We 
explored the subtleties involved in decomposing H{XY) into 
common and private parts. The richness of the information 
decomposition problem is already manifest in this simple case 
in which common and private informational parts sometimes 
cannot be isolated. We also inquired if a nonnegative PI 
decomposition of the total mutual information can be achieved 
using a measure of redundancy based on common information. 
We answered this question in the negative. In particular, we 
showed that for independent predictor RVs when any nonva¬ 
nishing redundancy can be attributed solely to a mechanistic 
dependence between the target and the predictors, any common 
information based measure of redundancy cannot induce a 
nonnegative PI decomposition. 

Existing measures of synergistic 16] and unique |[8| infor¬ 
mation use optimization over three auxiliary RVs to achieve 
a nonnegative decomposition. We leave as an open question 
if optimization over two auxiliary RVs can achieve a similar 
feat. Also, at present it is not clear if the coding-theoretic 
interpretation leading up to the counterexample in Example 11 
calls into question the bivariate PI decomposition framework 
itself. More work is needed to assess its implications on the 
definitions of redundant and unique information. 

In closing, we mention two other candidate decomposi¬ 
tions of the total mutual information. Pola et al. proposed 
a decomposition of the total mutual information between the 
target and the predictors into terms that account for different 
coding modalities |[58l . Some of the terms can, however, 
exceed the total mutual information m- Consequently, the 
decomposition is not nonnegative, thus severely limiting the 
operational interpretation of the different coding components. 
More recently, a decomposition of the total mutual informa¬ 
tion is proposed in im based on a notion of synergistic 
information, using maximum entropy projections on fc-th 
order interaction spaces im, nn). The ensuing decomposition 
is, however, incompatible with (LP) ifTTI . Like 7^, 5'^^^ is 
symmetric with respect to permutations of the target and the 
predictor RVs which strongly hints that fails to capture 
any notion of mechanistic dependence. Indeed, for the And 
example, computes to zero, and consequently (LP) is 
violated. 

In general, the quest for an operationally justified non¬ 
negative decomposition of multivariate information remains 
an open problem. Einally, given the subtle nature of the 
decomposition problem, intuition is not the best guide. 
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IV. Appendices 

A. Appendix A: Supplemental proofs omitted in Section II 

Proof of Lemma 2: (See Problem 16.25, p. 392 in 
II 24 I : also see Corollary 1 in ED). Given pq\xy such that 
X — Y — Q and Y — X — Q, it follows that pxYix,y) > 
0 ^ PQ\XY{q\x,y) = PQ\x{q\x) = PQ\Y{q\y) Vg. Given 
an ergodic decomposition of pxYix,y) such that X x y = 
X yq', where the Xgds and yqds having different 
subscripts are disjoint, dehne pq'\xy as Q' = q' 

X eXq, ^ ye yq,. Clearly H{Q'\X) = H{Q'\Y) = 0. 
Then, for any Q = q and for every g', PQ\xY{q\'c) is 
constant over Xg, x yq, which implies that PQ\xYiQ\x,y) = 
PQ\Q,{q\q')- Thus, for any q' for which PQ'{q') > 0, 

PxYQ\Q'{x,y,qW) = PQ\XYQ'{q\x,y,q')pxY\Q'{x,y\q') = 
PQ\xY{q\x,y)pxY\Q'{x,y\q') = PQ\Q'iq\q')pxY\Q'{x,y\q'), 
so that XY — Q' — Q. The converse is obvious. Thus, given (2), 
we get Q' such that I{XY\Q\Q') = 0 so that I{XY\Q) = 
I{XYQ'-Q) = I{Q'-,Q) = H{Q') - H{Q'\Q) < H{Q'). ■ 

Lemma AL pxY is saturable iff there exists a pmf pnixY 
such that X-Q-Y, Q-X-Y, Q-Y-X. 

Proof: Given Q : X — Y — Q,Y — X — Q,hy Lemma 2, 
there exists a pmf pq/I jfy such that i7(Q'|Ar) = H{Q'\Y) = 0 
and XY-Q'-Q. Clearly, I{X]Y\Q) = 0 ^ I{X]Y\Q') = 
0 since I{X-Y\Q) = I{XQ'-,Y\Q) > I{X-Y\QQ') = 
I{X-fY\Q'), where the last equality follows from XY — Q' — Q. 
Taking Q' as Q*, the claim follows. Taking Q^, as Q, the other 
direction is obvious. ■ 

Lemma A2. Cgk{X]Y) = I{X;Y) I{X;Y) = 

Cw(X-Y) (see Problem 16.30, p. 395 in iH). 

Proof: Let RV Qi achieve the minimization in 
(4). Note the following chain of equivalences ll2^ : 
I{XY-Qf) = I{X-Y) ^ H{XY\Q^) = H{X\Y) + 

H{Y\XpH{X\QiY) +H(Y\QiX) ^ Qi-X- 
Y, Qi — Y — X, where (a) follows from X — Qi — Y. 
The claim follows then from invoking Lemma 2 and noting 
that I{XY;Qi) — H{Qf) = Cgk(,X]Y), where Q* is the 
maximal common RV. ■ 

Lemma A3. Cgk{Xi]...]Xk) is monotonically nonincreas¬ 
ing in K, whereas Cw{Xi;...;Xk) is monotonically nonde¬ 
creasing in K. Also Cgk(Xi;...;Xk) < xahiI{Xi;Xj), while 

iAj 

CwiXi;...;XK) > ma.xI{Xi;Xj), for any i,j G AT}. 

Proof: Let Xj{ = be a AT-tuple of RVs ranging 

over finite sets Xi where A is an index set of size K, and 
let Vxa be the set of all conditional pmfs Pq|x^ s.t. \Q\ < 
Uli\X^\+2. Lirst note the following easy extensions. 

Cgk{Xi;...;Xk)= I{Xx,Q) 

Cw{Xi-,...-,Xk) = min. . I{Xx,Q) 

Q: 

Given Pq\x^ G Rxa such that (a) Q-Xi-Xj^\i,\li G A, we 

have I{Xj^\x\Q)=I{Xa\k\Q) + I{Xa\i\Q\Xi)>I{Xx,Q), 
where (b) follows from using i = 1 in (a), and (c) follows 
from noting that I{Xa\i]Q\Xi) > I{Xk]Q\Xa\k)- We then 


have 


Q: 


max I{Xa;Q) < 

y—Xi— 

yiGA 


Q- 


max I{Xa\k\Q) 

y-Xi— 

Vie^ 


(d) 

< max 

Q: y— 

'^i€A\K 


i{Xa\k;Q), 


where (d) follows since \/i G A, Q — Xi — XA\i implies 
Q-X,- XA\{^.K}.yi €A\K. Hence, Cgk{Xi;...;Xk) < 
Cgk{Xi;...;Xk-i)- Also note that for any i,j G A, 

I{XA]Q)^—I{Xi\Q)<I{Xi]Xj), where (e) follows from (a) 
and (f) follows from invoking the data processing inequality 
after using (a) again, since for any j G A, Q — Xj — XA\j 
Q — Xj — Xi, with i G A \ j. Hence Cgk{Xi\...-,Xk) < 
TO\TlI{Xi',Xj). 

The claim for monotonicity of the Wyner Cl is immedi¬ 
ate from noting that min I{Xa\k\Q) < 

Q: Xi-Q-Xj,Vi,jeA\K,z^j ' 

min I(Xa\Q), since the constraint set for 

Q-. Xi-Q-Xjyi,jGA,iiij 

Q in the RHS is a subset of that for the LHS. Further, for 
any i,j G A, X, - Q - X^ ^X^Xj) < I{Xf,Q) < 

I{Xa',Q), whence max/(A'i;A'_,) < Cw{Xi\...\Xk) follows. 


Proof of Lemma 3: Let Qy = Qi- Clearly, Qi —Y — X 
is a Markov chain. Y — Qi — X is also a Markov chain 
since given Qi = qi, Px\Q^{x\qi) = Y.yeyPxY\Q^{xy\ql) = 
T.y.Q,=q,PY\QAy\Ql)PX\YQAx\yqi) = PX\YQ^{x\yql), 
Vy given Qi = qi. Now let <52 = 9 (X) so that X — Q 2 —Y. 
For some y,y' G y, let g{y) = g{y') = q 2 . Then, Px\QAx\q 2 ) 
= Px\Y{x\y) = Px\Y{x\y'), x G X. Thus f{y) = f{y') 
which implies X — Qi — Q 2 — Y. Hence Qi = Qy is a 
minimal sufficient statistic of Fwith respect to X. ■ 

Proof of Theorem 1: From (7) and Lemma 3, it fol¬ 
lows that Qy is the minimizer in CwiX\X). Since Qy 
is a minimal sufficient statistic of Y with respect to X, 
for any Q s.t. Q — X — Y and X — Y — Q, it follows 
that H{Qy\Q) = 0 (also see Lemma 3.4(5) in ll40l L Thus, 
Qy achieves the maximum in PwiY\X). The decomposi¬ 
tion H{Y) = H{Y\Qy) + H{Qy) easily follows. Finally, 
min I[XY:Q) < min H(Q), because if 

X^-Y- Q then I{XY^)^^'l(Qfpf < H(Q), so that 
Cw{X-Y) < Cw{Y\X). ■ 

Proof of Theorem 2: When pxY lacks the structure to 
form components of size greater than one, it follows from 
Lemma 3 that CwiX\X) = H{Y). Consequently by Theorem 
1, Pw{Y\X) = 0. For the other direction, see Corollary 3 in 
1^, where analogous bounds for Cw(X\X) are given. 

For the second part, we prove the first equivalence. Let 
Q^ = X AY = gx{X) = gviY). For x G X, if 
y,y' G y do not induce different conditional distributions 
on X, i.e., if Px\Y{x\y) = Px\Y{x\y'), then we must have 
9Y{y) — 9Y{y')- This implies the existence of a function 
/ such that Q, = /(Qy). If pxY is saturable, we also 
have X — Qt, — Y. From Lemma 3 and Remark 3, it then 
follows that H{QA) = H^Qy) = I{X]Y) and consequently 
Pw{Y\X) = H{Y\X). For the converse, first note that 


H{QA) < I{X;Y) < Cw{X;Y) < H{Q§) (see Remark 3 
and (8d)). Demanding Pw{Y\X) = H(Y\X) or equivalently 
H{Q§) = I{X-Y) implies I{X-,Y) = CwiX-fY) = H{Q§) 
when from Remark 3 it follows that pxY is saturable. For the 
second equivalence, see Theorem 4 in 1391 . This concludes the 
proof. ■ 


B. Appendix B 

We briefly provide several examples and applications, 
where information-theoretic notions of synergy and redun¬ 
dancy are deemed useful. 

Synergistic and redundant information. Synergistic interac¬ 
tions in the brain are observed at different levels of description. 
At the level of brain regions, cross-modal illusions offer a 
powerful window into how the brain integrates information 
streams emanating from multiple sensory modalities 1631. A 
classic example of synergistic interaction between the visual 
and auditory channels is the Mcgurk illusion 1^ . Conflicting 
voice and lip-movement cues can produce a percept that differs 
in both magnitude and quality from the sum of the two 
converging stimuli. At the single neuronal level, temporally 
and spatially coincident multimodal cues can increase the 
bring rate of individual multisensory neurons of the superior 
colliculus beyond that can be predicted by summing the 
unimodal responses 1651 . In the context of neural coding, a 
pair of spikes closely spaced in time can jointly convey more 
than twice the information carried by a single spike l49l . In 
cortex studies, evidence of weak synergy have been been found 
in the somatosensory l58l and motor l5^ and primary visual 
cortex 1591 . Similarly, there are several studies evidencing 
net redundancy at the neuronal population level ESI, ED, 
I 55 I - I 59 I . 1^ . Often studies on the same model system 
have reached somewhat disparate conclusions. For instance, 
retinal population codes have been found to be approximately 
independent IMl . synergistic IMl . or redundant l62l . 

Unique information. A wealth of evidence suggests that 
attributes such as color, motion and depth are encoded uniquely 
in perceptually separable channels in the primate visual system 
l66l, ISS). The failure to perceive apparent motion with iso- 
luminant colored stimuli, dubbed as the color-motion illusion 
1661 demonstrates that the color and motion pathways provide 
unique information with respect to each other. There is also 
mounting evidence in favor of two separate visual subsystems 
ED that encode the allocentric (vision for perception) and 
egocentric (vision for action) coordinates uniquely along the 
ventral and the dorsal pathways, respectively, for object iden¬ 
tification and sensorimotor transformations. 

In embodied approaches to cognition, an agent’s phys¬ 
ical interactions with the environment generates structured 
information and redundancies across multiple sensory modal¬ 
ities that facilitates cross-modal associations, learning and 
exploratory behavior l68l . More recent work has focused 
on information decomposition in the sensorimotor loop to 
quantify morphological computation which is the contribution 
of an agent’s morphology and environment to its behavior l69l . 
Some related decompositions have also focused on extracting 
system-environment boundaries supporting biological auton¬ 
omy iTOl . 

Further motivating examples for studying information de¬ 
composition in general abound in cryptography na, dis¬ 
tributed control IT^ and adversarial settings like game theory 


03 , where notions of common knowledge shared between [ 29 ] 
agents are used to describe epistemic states. 
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